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Pe3iOMe: B Hacxoainaxa cxaxHa e onHcan anropHxtM sa aBxoMaxHHHO HSMepBane na 
ceMaHXHHHa 6jih30Cx Meac^y abohkh ffyuii na pasjiHHHH esHiiH (6tJirapcKH h pycKH). 
AjiropHxi.Mi.x HSBjiHHa jiOKajiHHxe KOHxeKCXH Ha AaACHHxe ffyuii npes cepna cnpaBKH b 
xtpcemaxa MauiHHa Google h onpeACJia GnHSOCxxa Mcac^y AyMHxe npes cpaBHCHHe na 
KOHxeKCXHxe HM. 3a npeBeacAane na KOHxeKCXHxe ox cahh esHK ki.m Apyr ce HsnojiSBa 
pcHHHK. HsMcpena e KopenaiiHa ox 71% Meac^y nojiynenHxe pesyjixaxH h 30-xe abohkh 
AyMH Ha Mnjiep h Hapjis, Koaxo e no-BHCOxa cnpaMO HSBecxHHxe ao MOMCHxa anropHXMH. 

Kjiiohobh flyMH: CeMaHXHHHa Gjihsocx, HSMepBane na ceMaHXHHHa 6jih30cx, jioKajiCH 
KOHxeKCx, H3noji3BaHe Ha ye6 xaxo Kopnyc, Google. 

1. AjiropHTBM 3a HSBjiHHaHC Ha ceMaHXHHHa 6jih30ct npes xtpcene b Google 

AjiropHxi.Mi.x HsntjiHaBa saaBKH b xtpcemaxa MauiHHa Google h anajiHSHpa BtpnaxHxe 
oxpastiiH ox xeKCxoBe. Ox xax HSBjiHHa x. nap. jiOKajien KOHmeKcm na Bcaxa aHajiHSHpana 
AyMa (AyMHxe b HenocpeACXBena Gjihsocx ao nea), xtfi xaxo xoh ctAtpaca AyMH, kohxo ca 
ceMaHXHHHO CBtpsaHH c Hea [Hearst, 1991]. Ho HSBjienenHxe jioKanHH KOHxexcxH sa Bcaxa 
AyMa ce nocxpoaBa HecmomeH eeKmop, kohxo ctAtpaca bchhkh AyMH ox cfcOXBexHHxe 
jiOKajiHH KOHxeKCXH sacAHO c necxoxHxe hm na cpemane. CeMaHXHHHaxa Gjihsocx MeacAy 
ABOHKa AyMH ce onpeAena xaxo Kocnnyc MeacAy necxoxHHxe hm BCKxopn b n-Mepnoxo 
eBKjiHAOBO npocxpancxBO h npcACxaBjiaBa hhcjio MeacAy h 1 . Koraxo pasrjieacAaHHxe AyMH 
ca Ha pasjiHHHH cshiih, xcxhhxc kohxckcxh (kohxo ctino ca na pasjiHHHH cshiih) ce 
cpaBH^Bax Kaxo npeABapHxenHO CAHHHax kohxckcx ce npcBeacAa na Apyrna cshk npes 
peHHHK, KaKxo e onncano b [Nakov h kojicrxmb, 2007a] . AjiropHxi.Mi.x Moace Aa ce nojiSBa sa 
HSMepBane na ceMaHXHHHa Gjihsocx ne caMO MeacAy AyMH, ho h MeacAy 4)pasH. 

3a HSBUHHanexo na jiOKanHHa kohxckcx na AaAcna AyMa ox Hnxepnex HsnojiSBaMe saaBxa sa 
xtpcene na AyMaxa b Google, b Koaxo yxasBaivie m GtAax BtpnaxH 100 pesyjixaxa na 
CfcOXBexHHa esHK (b HauiHa cjiynaH GtnrapcKH hjih pycKH). CIO xaxHBa saaBKH HSBjiHHaMe 
AO 1000 pesyjixaxa (Google ne nosBOJiaBa Aa HSBjieneM noBcne). Bcckh pesyjixax ctAtpaca 
sarjiaBHC h oxpastK ox xeKCx, ctAtpacainH xtpcenaxa AyMa hjih neiiHa cnoBOCJ^opivia. 

Ox iiSBneHeHiixe pesynxaxii ntpBO iiSBniiHaMe bciihkii nocneAOBaxenHOCxii ox AyMii. CncABa 
npeMaxBane na bciihkii (JDyHKiiiiOHajiHii AyMii (npcAJiosii, MecxoiiMCHiia, ctiosii, nacximii, 
MeacAyMCxiia II H^KOii HapeHHa), KaKxo ii AyMii c no-ManKO ox 3 GyxBii. Cuca xoBa npcMiiHa- 
BaMe npes iiSBneHCHiixe nocncAOBaxenHOCxii ox AyMii ii xtpciiM AaAcnaxa AyMa iinii neiiHa 
cnoBOCJDopMa ii BSiiMaMe 3 AyMii npcAii ii cneA nea (hiichoxo 3 HapiiHaMe pasMep na 
KOHxcKCxa). Tesii AyMii CHiixaMe sa nacx ox noKanHiia ye6 kohxckcx. Bciihkii iiSBneneHii 
AyMH saMCKHMe c xaxHaxa ocHOBHa cnoBocj^opivia (npiinaraMe neMaxiisaiiiiH) xaxo sa iienxa 
nonsBaMe 6oraxii pcHHimii na ncMiixe b GtnrapcKiia ii pycKiia esiiK. Haxpaa nonynaBaMe 
ccMaHxiiHHaxa Gniisocx MeacAy ABoiiKa AyMii xaxo npecMexHCM Kocimyc McacAy necxoxHiixe 
HM BCKxopH B n-McpHOxo CBKHHAOBO HpocxpaHCXBO. HojiyHaBa ce HHCno MeacAy H 1, Koexo 

HOKaSBa AOKOHKO ABC AyMH CH HpHUHHaX ceMaHXHHHO. 

3a HSMcpBane na ceManxHHna Ghhsocx MejKAy AyMH na pasnnnnn cshiih HSHOnsBaMC oxhobo 
KOHxcKCXHxe, HSBneHCHH OX Googlc, HO CAHHHax OX xax HpcBeacAaMC Ha Apyrna esHK npes 
peHHHK ox ABOHKH AyMH, KOHXO ca HpcBOA CAHa Ha Apyra. Koraxo sa CAna AyMa ox cahhiih 

eSHK HMa HaKOHKO CtOXBCXHH HpCBOAHH AyMH OX APyrilH CSHK, BC^Ka OX XaX CC BSHMa HOA 



BHHMaHHC c eAHaKBa xejKecT. ^MHxe ox Asaxa esHxa, sa kohxo naivia ctoxBcxHO SHaneHHe b 
pcHHHKa, He ce BSHMax noA BHHMaHHe. 

1.1. TF.IDF npexerjiHHe 

IIpH HSBjiMHane na HHCJDopMaiiHa (information retrieval) necxo ntxH ce npHjiara x. nap. 
TF.IDF npexerjiane na necxoxHxe na oxAenHHxe jjyuii, Koexo osnanaBa, ne no-necxo 
cpemaHHxe ffyuii ynacxBax c no-rojiaMa xeacecx. TasH xexHHKa e onHcana noApoGno b 
[Sparck- Jones, 1972]. MoaceM ^a a npHjioacHM xaxo npe^H npecMHxane na KOCHHyca Meac^y 
BeKxopHxe Ha AaACHH abc ffyuii saMecxBaMe necxoxaxa na Bcaxa AyMa ox necxoxHHa Bexxop 
c HSHHCJienaxa sa nea TF.IDF cxohhocx. 

1.2. CeMaHTHHHa 6jih30Ct, HSMepena npes oSpaxen kohtckct 

IIpH HSBUHHane na jiOKaneH KOHxexcx sa AaAcna AyMa ox ye6 necxo ntxH b Hero nona^ax 
AyMH, KOHXO He ca ceMaHXHHHO CBtpsaHH c nea. HpeMaxBanexo na xaxHBa AyMH ox 
jiOKajiHHa ye6 KOHxexcx cjieABa Aa AOBe^e ao noBHuiaBane na xoHHOCxxa npn oiieiMBaHe na 
ceMaHXHHHaxa Gjihsocx, samoxo b KOHxexcxa me nonaAax caMO AyMH, kohxo HaHcxHHa HMax 
ceMaHXHHHa Bptsxa c xtpcenaxa AyMa [Nakov h KOJieKXHB, 2007b]. 

HsnojiSBanexo na odpamen KOHmeKcm ce ocHOBaBa na HAeaxa, ne axo abc AyMH ca 
ceMaHXHHHO CBtpsaHH, xo ntpBaxa xpaGBa Aa ce cpema necxo b KOHxexcxa na Bxopaxa h 
ctmeBpeMeHHO Bxopaxa xpaGBa Aa ce cpema necxo b KOHxexcxa na ntpBaxa. Taxa necxoxaxa 
Ha AaACHa AyMa A b KOHxexcxa na Apyra AyMa B Moace Aa ce npecMCxna ABa ntxH: cahh ntx 
Kaxo 6poa cpemaHHa na A b KOHxeKCxa na B h Bxopn ntx - xaxo 6poa cpemaHHa na B b 
KOHxcKCxa Ha A. Haxpaa Moace Aa ce bscmc no-Manxaxa ox abcxc cxohhocxh. 

IIpH HSHHCJiaBanexo na BCKxopa na BsaHMHHxe cpemaHHa npes oGpaxen kohxckcx e Ao6pe Aa 
ce HPHopHpax AyMH, kohxo ce cpemax HpeKajieno ManKO na 6poH ntxH (npHMepno no-ManKO 
ox 10), samoxo xoBa Moace Aa e cjiynaHHO. C npoMana na xosh napaMextp {npaz na 
Hecmomamd) Moace Aa ce Bjinae Btpxy xoHHOCxxa na pesyjixaxHxe. 

1.3. CeMaHTHHHa 5jih30Ct npes oSoraxiiBaHe na KOHxeKcxa 

OdozamHeane na KOHmeKcma osnanaBa Aa AoGaBHM KtM KOHxcKCxa na AaAcna AyMa 
KOHxcKCXHxe Ha BCHHKH Hecxo cpcinaHH B Hcro AyMH [Hagiwara h kojickxhb, 2007]. Ho xosh 
HaHHH KOHxcKCxtx Ha AyMaxa ce pasmnpaBa c ome AyMH, kohxo opHrHnanHO ne npnctcxBax 
B Hero, HO ca CBtpsann cmhcjiobo c xasH AyMa. OnaKBanHaxa ca xoBa Aa noAoGpn xoHHOCxxa 
na ajiropHxtMa sa HSMcpBane na ceManxHHnaxa Gjihsocx MeacAy ABOHKa AyMH. 

IIpH oGoraxaBane na KonxcKCxa, e Ao6pe Aa ce nrHopnpax AyMH, kohxo ce cpemax b nero 
HpeKajieno ManKO na 6poH Htxn (npHMepno no-ManKO ox 10), samoxo xoBa Moace Aa e 
cjiynaHHO. C npoMana na xosh napaMCxtp {npae na Hecmomamd) Moace Aa ce Bjinae Btpxy 
xoHHOCxxa Ha pesyjixaxnxe, Kaxo ce saAaAe pasyMna rpaHHiia na MHHHMajiHUH 6poH 
cpemaHHa, npn kohxo ce HSBtpuiBa oGoraxaBane na KOHxeKCxa. 

2. EKcnepHMCHTH H pesyjixaxH 

EKcnepHMCHXHxe, kohxo nanpaBHXMe, HMax sa iieji Aa OLtenax npeAJioaceHHxe anropHXMH sa 
aBxoMaxHHHO HSBUHHane na MeacAyesHKOBa ceMaHXHHHa 6jihsocx ox ye6 npes cpaBHeHHe na 
nojiyneHMxe ox xax pesynxaxH c oiieHKH, AaACHH ox hobck. 

2.1. TecxoBH /laHHH 

Kaxo xecxoBH AaHHH HsnojiSBaMe cnHCtKa ox 30 abohkh AyMH, npcAJioaceHH ox Mnjiep h 
Hapjis [Miller & Charles, 1991]. Te npcACxaBjiaBax BHHMaxenHO noA5paHH abohkh 
CbmecxBHxejiHH HMcna, sa BcaKa ox kohxo e nanpaBena oiiCHKa na ceManxMHRaxa Gjihsocx ox 
51 AyuiH B CKajia ox ao 4, cjica kocxo oiienKaxa e ycpeAHena. IIocoHeHHxe ox Mnjiep h 



Hapjis 30 ABOHKH ffyuii npcBCAOXMe ctoxBexHO na GtJirapcKH h pycKH esHK. IIpH npcBOAa ne 
HaBcaKtAC ycnaxMe Aa HaMepHM tohho cfcOTBexcTBHe MeacAy aHrjiHHCKH, GtnrapcKH h 
pycKH esHK H Ha mhofo Mecxa ce HsryGnxa HKDaHCHxe na opHrHHajiHHxe AyMH, Koexo 6h 
Morjio m HanpaBH nexoHHa AaAenaxa ox Mnjiep h Hapjis noBeuiKa oiieHKa, ho HHe npneMaMC 
X03H pHCK H 3HaeM, He He MoaceM ^a onaKBaivie 100% xohhocx na pesyjixaxHxe. 

2.2. n3noji3BaHH pecypcH 

3a ijejiHxe na eKcnepHMenxHxe h npn peajiHSHpanexo na anropHxtMa sa HSBUHHane na 
ceMaHXHHHa 6jih30cx oxye6 6axa HsnojiSBaHH cjieAHHxe pecypcn: 

• FpaiviaTHHeH pchhhk na 5i>jirapcKHn h pycKHH e3HK [Paskaleva, 2007]. B 
GtJirapcKHH CH Bap Manx peHHHKtx cfcAtpaca 963 339 cjiOBOCJ^opMH h 73 113 jieMH. B 
pycKHa CH BapnaHx peHHHKtx ctAtpaca 1 390 613 cjiOBo4)opMH h 66 101 jieMH. 

• CnHCtK c ^yHKijHOHajiHHxe ayiviH b 6i>jirapcKH{i h pycKHH e3HK (598 GtJirapcKH h 
507 pycKH AyMH: npeAJiosn, MecxoHMenHH, ctiosH, nacxHiiH, MeacAyMexHa h HapeHna). 

• KpaxtK StJirapo-pycKH pchhhk, ctA^paca 4 562 abohkh AyMH, kohxo ca npeBOA 
CAHa na Apyra. CtcxaBen ox OHjiaHH GtJirapo-pycKH penHHK [BgRu.net, 2007]. 

• IIoflpoSeH StJirapo-pycKH pchhhk, ctAtpaca 59 582 abohkh AyMH h (j^paan, kohxo ca 
npcBOA CAHa na Apyra. CtcxaBcn ox ABa rojiCMH GtJirapo-pycKH h pycKO-GtJirapcKH 
peHHHKa [HyxajiOB, 1986] h [BepHinaHH, 1986]. 

2.3. OHHcaHHC na eKCHepHMeHTHxe 

Btpxy aAanxHpaHHxe ox Mnjiep h Hapjis 30 abohkh AyMH h 4)pa3H ca npoBCACHH cepna 
eKcnepHMCHXH sa oiieHaBane na ccManxHHHaxa hm Gjihsocx npes HsntjineHHe na onncannxe 
ajiropHXMH npn pasjiHHHH xcxhh napaMexpH (c pasjinnna rojieMHHa na peHHHiiHxe, c h Ges 
npHjiarane na TF.IDF, c h Ges HsnojiSBane na oGpaxen kohxckcx, c h 6e3 oGoraxaBane na 
KOHxcKcxa H npH pasjiHHHH cxoHHOcxH Ha MHHHMajiHaxa Hccxoxa Ha cpemane na AyMHxe): 

• RAND - cjiynaHHa Gjihsocx, saAaAcna sa bchhkh abohkh AyMH. 

• SIM - ocHOBHHax ajiropHxtM sa HSBUHHane na ccManxHHHa Gjihsocx ox ye5. 

• SIM-BIG - ocHOBHHax anropHxtM SIM c hoapo6hhh GtJirapo-pycKH pchhhk. 

• SIM+TFIDF - MOAHCJDHKaiiHa na SIM anropHxtMa c HSHOJiSBane na TF.IDF. 

• SIM-BIG+TFIDF - MOAHCJDHKaiiHH na SIM anropHxtMa c HSHOJiSBane na TF.IDF 
HpexerjiaHC h HSHOJiSBane na noApoGnna Gtnrapo-pycKH pchhhk. 

• REV-0, REV-10, REV-20, REV-30, REV-40, REV-50 - M0AH4)HKauHH na SIM 
ajiropHxtMa c HSHOJiSBanc na oGpaxcn kohxckcx c nparoBC 0, 10, 20, 30, 40 h 50. 

• REV-BIG-0, REV-BIG-10, REV-BIG-20, REV-BIG-30, REV-BIG-40, REV-BIG- 

50 - MOAHCJDHKaiiHa Ha REV anropHxtMa c noApoGnna Gtnrapo-pycKH pchhhk. 

• IND-10, IND-20, IND-30, IND-40, IND-50 - M0AH4)HKauHa na SIM ajiropHxtMa c 
HSHOJiSBane na oGoraxen kohxckcx c nparoBC 10, 20, 30, 40 h 50. 

• IND-BIG-10, IND-BIG-20, IND-BIG-30, IND-BIG-40, IND-BIG-50 - M0AH4)HKa- 
ijHa Ha IND ajiropHxtMa c HSHOJiSBanc na noApoGnna Gtnrapo-pycKH pchhhk. 

2.4. PesyjixaxH 

IIojiyHCHHxc pcsyjixaxH ox namnxc anropHXMH sa aBxoMaxnnnoxo nsMcpBanc na ccManxHHna 
Gjihsocx ca cpaBHcnn c noBCuiKaxa oiiCHKa npcs hshhcjichhc na Koe(puuueHma na Kopenaijun 
na Uupc-bH (cxaHAapxna cxaxncxnnccKa MapKa sa nsMcpBanc na jiHHCHHa BsaHMOBptSKa 

MCJKAy ABC HpOMCHJIHBH BCJIHHHHh). IIOJiyHCHH Ca CJICAHHXC pCSyjIXaXH! 



AjiropHTbM 


nparO 


npar 10 


npar 20 


npar 30 


npar 40 


npar 50 


RAND 
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SIM 


0,7043 


- 
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SIM+TFIDF 


0,7010 
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SIM-BIG 


0,6210 
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SIM-BIG+TFIDF 


0,6191 
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REV 


0,5933 


0,5732 


0,5623 


0,5625 


0,5623 


0,5492 


REV-BIG 


0,5961 


0,5964 


0,5956 


0,5957 


0,5953 


0,5920 


IND 


- 


0,5078 


0,6027 


0,6850 


0,6485 


0,6445 


IND-BIG 


- 


0,5046 


0,6057 


0,7149 


0,6296 


0,6412 



2.5. AnajiHS na pesyjixaxHTe 

Ot xaGjiHLiaTa e bhaho, ne ceMaHTHHHaxa 6jih30ct, oiieHeHa aBTOMaxHHHO c npeAJiojKeHHxe 

ajirOpHTMH, HMa KOpenailHa etc CtOXBCTHHTe HOBeUIKH OLieHKH OT 50% AO 71%. TasH 

KopejiaiiHa e mhofo no-BHCOKa ox 0%, Koaxo ce nojiynaBa npn cjiyHaHHaxa oiieHKa RAND. 

Maxap H pesyjixaxHxe ox ocHOBHHa anropHxtM SIM ^a ca Aocxa AoGpn, BHac^aMe, ne bchhkh 
onHXH 3a HeroBOxo noAoGpcHHe ne ca mhofo ycneuiHH. Ox pesyjixaxHxe MoaccM ^a 
HanpaBHM cjieAHHxe saKjiKDHeHna: 

• HsnojiSBanexo na TF.IDF npexerjiane Bjinae HcraxHBHO. 

• HsnojiSBanexo na oGpaxen kohxckcx hc paGoxH Ao6pe h REV anropHxtMa pa6oxH 
no-jiomo ox ochobkh^ SIM anropHxtM npn BcaxaKBH nparoBC na necxoxaxa. 

• OGoraxaBanexo na KOHxexcxa (IND anropHxtMa) paGoxH ManKO no-Ao5pe ox 
ocHOBHHa SIM ajiropHxtM caMO npH BHHMaxenHO noA6paH npar na necxoxaxa. 

• HsnojiSBanexo na noApoGnHa bmccxo KpaxKHa penHHK noMara caMO b hhkoh cjiynaH. 
OcHOBHHxe npHHHHH 3a HCxoHHOCx Ha pcsyjixaxHxe ca naKOJiKo: 

• 3ary6a na HioaHCH npn npcBOAa na 30-xe AyMH na Mnjiep h Hapji3. 

• HsnojiSBanexo na ye6 xaxo Kopnyc orpaHHnaBa HSBUHHanexo na jiOKaneH kohxckcx 
HSMcacAy caMO 1000 cxaxHH, a xe hc ca npeACxaBHxejiHa HSBaAKa na bchhkh cxaxHH. 

• HsnojiSBanexo na AyMH, a hc 4)pa3H, npn HSBUHHane na KOHxeKCXHxe h cjica xoBa npn 
npcBOAa BHaca mhofo uiyM. ToBa e ochobch HCAOCxaxtK na onHcanHxe anropHXMH. 

• HentjiHOxa na npcBOAHHxe pchhhlih. ^mhxc ox ABaxa esHKa, sa kohxo naivia 

CbOXBCXHO SHaHCHHC B pCHHHKa, HC CC BSHMaX nOA BHHMaHHC (HPHOpHpaX CC). 

3. Jlpyrn paspaSoxKH no xeiviaTa h cpaBHCHHe c xhx 

noBCHCxo HSBCcxHH McxoAH 3a aBxoMaxHHHO OLtCHaBaHC Ha ccMaRXMHRa 6jiH30cx CC GasHpax 
Ha jiHHrBHCXHHHaxa xHnoxeaa sa pasnpcAejieHHexo (distributional hypothesis) [Harris, 1954], 
Koaxo xBtpAH, HC ceManxHHHO 6jiH3KHxe AyMH ce cpemax b Gjihskh kohxckcxh. Ha nea e 
ocHOBan n namnHx noAxoA- 

[Weeds, 2003] cpaBnaBa 6 ajiropnxtMa sa nsBjinHane na ceManxn^na 6jih30Cx, Gaanpann na 
xnnoxesaxa sa pasnpeAejiennexo (n HensnojiSBamn Aontjinnxejinn pecypcn), n ycxanoB^Ba, 
He nan-AoGpnax ox xax nocxnra KoecJDniiHeHx na KopenaiiHa na Hnpctn ox 62%. Hamnax 
HaH-Ao6i.p pesyjixax ox 71% e no-AoGtp ox onncannxe ox Weeds ajiropnxMn, Kaxo npn xoBa 
ce oxnaca sa no-cjioacnaxa saAana sa nsMcpBane McacAyesHKOBa ceManxn^na 6jihsocx. 

[Budanitsky & Hirst, 2006] cpaBnaBax 5 pasjinnnn anropHxtMa sa OLtenaBane na ceManxn^na 
Gjihsocx, GasHpann na WordNet. Han-AoGpnax ox xax nocxnra Kopenaiina na Hnpctn ox 
85% sa 30-xe abohkh AyMn na Mnjiep n Hapjis, Koexo e no-Ao6tp pesyjixax b cpaBnenne c 
namnxe anropnxMn, no nsnojiSBa WordNet (kohxo naivia 6tJirapcKH n pycKn Bapnanx). 
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